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Abstract 

All practical applications contain some degree of non- 
determinism. When such applications are replicated to 
achieve Byzantine fault tolerance (BFT), their nondetermin- 
istic operations must be controlled to ensure replica consis- 
tency. To the best of our knowledge, only the most simplis- 
tic types of replica nondeterminism have been dealt with. 
Furthermore, there lacks a systematic approach to handling 
common types of nondeterminism. In this paper, we propose 
a classification of common types of replica nondeterminism 
with respect to the requirement of achieving Byzantine fault 
tolerance, and describe the design and implementation of 
the core mechanisms necessary to handle such nondeter- 
minism within a Byzantine fault tolerance framework. 

Keywords: Byzantine Fault Tolerance, Intrusion Tolerance, 
Replica Nondeterminism, Security, Fault Tolerance Middle- 
ware 

1. Introduction 

Today's society has increasing reliance on services pro- 
vided over the Internet. These services are expected to be 
highly dependable, which requires the applications provid- 
ing such services to be carefully designed and implemented, 
and rigorously tested. However, considering the intense 
pressure for short development cycles and the widespread 
use of commercial-off-the-shelf software components, it is 
not surprising that software systems are notoriously imper- 
fect. The vulnerabilities due to insufficient design and poor 
implementation are often exploited by adversaries to cause a 
variety of damages, e.g., crashing of the applications, leak- 
ing of confidential information, modifying or deleting of 
critical data, or injecting of erroneous information into the 
application data. These malicious faults are often modeled 
as Byzantine faults. One approach to tackle such threats 
is to replicate the server-side applications and employ a 
Byzantine fault tolerance (BFT) algorithm as described in 

cuasnsi. 
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Byzantine fault tolerance algorithms require the replicas 
to operate deterministically, i.e., given the same input un- 
der the same state, all replicas produce the same output and 
transit to the same state. However, all practical applications 
contain some degree of nondeterminism. When such appli- 
cations are replicated to achieve fault and intrusion toler- 
ance, their nondeterministic operations must be controlled 
to ensure replica consistency. 

To the best of our knowledge, only the most simplistic 
types of replica nondeterminism have been dealt with under 
the Byzantine fault model Q [8] |9] [10], which we term as 
wrappable nondeterminism and verifiable pre-determinable 
nondeterminism. The former assumes that any nondeter- 
ministic operations and their side effects can be mapped into 
some pre-specified abstract operations and state, which are 
deterministic. The later assumes that any nondeterministic 
values can be determined prior to the execution of a request, 
and such values proposed by one replica can be verified by 
other replicas in a deterministic manner, and the values are 
accepted only if they are believed to be correct. 

The mechanisms designed to handle these types of non- 
determinism either are not effective in guaranteeing replica 
consistency and/or are not effective in masking Byzantine 
faults, if the application to be replicated exhibits other types 
of nondeterministic behavior. For example, many online 
gaming applications contain nondeterminism whose values 
{e.g., random numbers that determine the state of the ap- 
plications) proposed by one replica cannot be verified by 
another replica. It is dangerous to treat this type of nonde- 
terminism the same as the verifiable pre-determinable non- 
determinism because a faulty replica could use a predictable 
algorithm to update its internal state and collude with its 
clients, without being detected, which defeats the purpose 
of Byzantine fault tolerance. As another example, multi- 
threaded applications may exhibit nondeterminism whose 
values {e.g., thread interleaving) cannot be determined prior 
to the execution of a request (without losing concurrency), 
which cannot be handled by existing BFT mechanisms. 

In this paper, we introduce a classification of common 
types of replica nondeterminism present in many appli- 
cations. We propose a set of mechanisms that can be 
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Figure 1. The positioning of the application 
and the BFT library, and the core interfaces 
between the two components. 

used to control these types of nondeterministic operations. 
We also describe the implementation of the core mecha- 
nisms and their integration with a well-known BFT frame- 
work (7] [8] [9] [10). Our performance evaluation of the in- 
tegrated framework shows that our mechanisms only intro- 
duce moderate runtime overhead. 

2. Byzantine Fault Tolerance 

This work is built on top of the BFT framework devel- 
oped by Castro, Rodrigues, and Liskov [7 8j|9][T0]. We use 
the same assumptions and system models as those of the 
BFT framework. For completeness, we briefly summarize 
the BFT framework here. 

The BFT framework supports client-server applications 
running in an asynchronous distributed environment with a 
Byzantine fault model, i.e., faulty nodes may exhibit arbi- 
trary behaviors. It requires the use of 3f+l replicas to toler- 
ate up to f faulty nodes. (In a recent publication |fl9l , Yin et 
al. proposed a method to reduce the number of replicas to 
2f+l by separating the executing and agreement nodes.) 

The BFT framework is implemented as a library to be 
linked to the application code (both the server and the client 
sides), as shown in Figure Q] In general, on the server side, 
we use the term replica to refer to the combined entity of 
the server application and the BFT library. On the client 
side, we use the term client to refer to the combined en- 
tity of the client application and the client-side BFT library. 
Sometimes, however, it is necessary to distinguish the two 
parts explicitly. As shown in Figure [JJ the client-server ap- 
plication and the BFT mechanisms (residing in the BFT 
library) interact via a set of Application Programming In- 
terfaces (APIs). The APIs contain a number of downcalls 
to be invoked by the application for a number of purposes, 
for example, to initialize the BFT library with appropriate 
parameters and callback functions, to send requests to the 
server replicas, and to start the event loop managed by the 
BFT library. The APIs also contain a number of upcalls to 
be implemented and supplied by the application, so that the 
BFT mechanisms can deliver a request to the server appli- 
cation, retrieve and verify nondeterministic values (if appli- 
cable), and retrieve and restore application state. Figure Q] 
includes a subset of the APIs directly related to this work. 

In the BFT framework, a replica is modeled as a state 



machine. The replica is required to run (or rendered to run) 
deterministically. The state change is triggered by remote 
invocations on the methods offered by the replica. In gen- 
eral, the client first sends its request to the primary replica. 
The primary replica then broadcasts the request message to 
the backup replicas and also determines the execution order 
of the message. All correct replicas must agree on the same 
set of request messages with the same execution order. In 
other words, the request messages must be delivered to the 
server application at all replicas reliably in the same total 
order. 

In the BFT framework, a very efficient Byzantine agree- 
ment algorithm, often referred to as the BFT algorithm, is 
used to ensure the total ordering of the requests received 
from different clients. The normal operations of the BFT 
algorithm involve three phases. The first phase is called 
the pre-prepare phase, where the primary replica multicasts 
a p re_p re p are message containing the ordering informa- 
tion, the client's request, and the nondeterministic values 
that can be determined prior to the execution of the re- 
quest (if any) to all backup replicas. A backup replica then 
verifies the ordering information, the nondeterministic val- 
ues, and the validity of the request message. If the backup 
replica accepts the pre_prepare message, it multicasts to 
all other replicas a prepare message containing the or- 
dering information and the digest of the request message 
being ordered. This starts the second phase, i.e., the pre- 
pare phase. When a replica has collected 2f valid prepare 
messages for the request from other replicas, it multicasts a 
commit message. This is the start of the third phase. When 
a replica has received 2f matching commit messages from 
other replicas, the request message has been totally ordered 
and it is ready to be delivered to the server application. This 
concludes the third phase, i.e., the commit phase, of the 
BFT algorithm. In the BFT framework, all messages are 
protected by a digital signature, or an authenticator [6| to 
ensure their integrity. 

3. Classification of Replica Nondeterminism 

We distinguish replica nondeterminism into the follow- 
ing three major categories: 

• Wrappable nondeterminism. This type of replica non- 
determinism can be easily controlled by using an 
infrastructure-provided or application-provided wrap- 
per function, without explicit inter-replica coordina- 
tion. For example, information such as hostnames, 
process ids, file descriptors, etc. can be determined 
group-wise. Another situation is when all replicas are 
implemented according to the same abstract specifica- 
tion, in which case, a wrapper function can be used 
to translate between the local state and the group-wise 
abstract state, as described in [ 10 1. 



• Pre- determinable nondeterminism. This is a type of 
replica nondeterminism whose values can be known 
prior to the execution of a request and it requires inter- 
replica coordination to ensure replica consistency. 

• Post-determinable nondeterminism. This is a type 
of replica nondeterminism whose values can only be 
recorded after the request is submitted for execution 
and the nondeterministic values won't be complete un- 
til the end of the execution. It also requires inter- 
replica coordination to ensure replica consistency. 

In this paper, we will not have further discussion on the 
wrappable replica nondeterminism because it can be dealt 
with using a deterministic wrapper function without inter- 
replica coordination, and also because it has been thor- 
oughly studied in iflOl . Instead, we will focus on the rest 
of two types of replica nondeterminism. 

Based on if a replica can verify the nondeterministic 
values proposed (or recorded) by another replica, replica 
nondeterminism can be further classified into the following 
types: 

• Verifiable nondeterminism. The type of replica nonde- 
terminism whose values can be verified by other repli- 
cas. 

• Non-verifiable nondeterminism. The type of replica 
nondeterminism whose values cannot be completely 
verified by other replicas. Note that a replica might be 
able to partially verify some nondeterministic values 
proposed by another replica. This would help reduce 
the impact of a faulty replica. 

Overall, our classification gives four types of replica 
nondeterminism of our interest: 

• Verifiable pre-determinable nondeterminism (VPRE). 
In the past, clock-related operations have been treated 
as this type operations. However, strictly speaking, 
it is not possible for a replica to verify deterministi- 
cally another replica's proposal for the current clock 
value without imposing stronger restriction on the syn- 
chrony of the distributed system (e.g., bounds on mes- 
sage propagation and request execution). 

• Non-verifiable pre-determinable nondeterminism 
(NPRE). Online gaming applications, such as Black- 
jack [1] and Texas Hold' em [18|, exhibit this type of 
nondeterminism. The integrity of services provided by 
such applications depends on the use of good secure 
random number generators. For best security, it is 
essential to make one's choice of a random number 
unpredictable, let alone verifiable by other replicas. 

• Verifiable post-determinable nondeterminism 
(VPOST). We have yet to identify a commonly used 



application that exhibits this type of nondeterminism. 
We include this type for completeness. 

• Non-verifiable post-determinable nondeterminism 
(NPOST). In general, all multithreaded applications 
exhibit this type of nondeterminism. For such appli- 
cations, it is virtually impossible to determine which 
thread ordering should be used prior to the execution 
of a request without losing concurrency. 

4. Controlling Replica Nondeterminism 

In this section, we present core mechanisms for control- 
ling replica nondeterminism for Byzantine fault tolerance, 
and provide a brief informal proof of the correctness of our 
mechanisms. Our mechanisms rely on the same set of APIs 
as those in the original BFT library to retrieve from, up- 
load to, and verify by applications of the nondeterminis- 
tic values, albeit with some modifications to the parameter 
list. The most relevant APIs have been shown in Figure 1 . 
Due to space limitation, we omit the detailed explanation of 
these APIs. 

Our mechanisms work in the following ways. When the 
primary receives a client's request, if it is ready to order 
the message, it invokes the propose_value ( ) callback 
function registered by the application layer. The applica- 
tion supplies the type of nondeterminism that would be in- 
volved in the execution of the request, and if applicable, 
the nondeterministic values. Depending on the type of non- 
determinism returned by the application, the modified BFT 
algorithm operates differently according to the mechanisms 
described from Section l4TI through Section l4~4l 

In practical applications, the execution of a request 
often involves with more than one type of nondeter- 
minism, for example, both time-related nondetermin- 
ism (which is of verifiable pre-determinable type) and 
multithreading-related nondeterminism (which is of non- 
verifiable post-determinable type). To accommodate this 
complexity, a bitmask should be used instead of an inte- 
ger value to capture the nondeterminism type information in 
the propose_value ( ) and check_value ( ) upcalls. 
However, the data structure used to store the nondetermin- 
istic values does not need to be made more sophisticated 
because it is the application's duty to generate and interpret 
them. Our algorithm can readily cope with this complexity. 
Using the same example, the time-related nondeterminis- 
tic values can be determined during the pre-prepare-update 
phase. The multithreading-related nondeterminism can be 
resolved in the post-commit phase. 

4.1. Controlling VPRE nondeterminism 

If the nondeterminism for the operation at the primary is 
of type vpre, the application provides the nondeterministic 
values in the ndet parameter. The obtained information is 




Figure 2. Normal operation of the modified BFT 
able (right) pre-determinable nondeterminism. 

included in the pre_prepare message, and it is multicast 
to the backup replicas. 

On receiving the pre_prepare message, a backup 
replica invokes the check_value () callback function. 
The replica passes the information received regarding the 
nondeterminism type and data values to the application 
layer so that the application can verify (1) the type of nonde- 
terminism for the client's request is consistent with what is 
reported by the primary, and (2) the nondeterministic values 
proposed by the primary is consistent with its own values. 
If either check turns out to be false, the check_value ( ) 
call returns an error code, the backup replica then suspects 
the primary. Otherwise, the backup replica accepts the 
client's request and the ordering information specified by 
the primary, logs the pre_prepare message and multicasts 
a prepare message to all other replicas. From now on, the 
algorithm works the same as that of the original BFT frame- 
work, with the exception that the prepare and commit 
messages also carry the digest of the nondeterministic val- 
ues. The normal operations of the modified BFT algorithm 
is illustrated in Figure [2] 

4.2. Controlling NPRE nondeterminism 

If the nondeterminism for the operation at the primary 
is of type npre, the application at the primary proposes 
its share of nondeterministic values. The type of nonde- 
terminism and the nondeterministic values are included in 
the pre_prepare message, and it is multicast to all backup 
replicas. 

On receiving the pre_prepare message, a backup 
replica invokes the check_value ( ) callback function to 
verify the nondeterminism type information supplied by the 



algorithm in handling verifiable (left) and nonverifi- 

primary replica (after it has verified the client's request and 
the ordering information). If the verification is successful, 
the backup replica invokes the propose_value ( ) func- 
tion to obtain its share of nondeterministic values. It then 
builds a pre_prepare_update message including its own 
nondeterministic values, and sends the message to the pri- 
mary. 

When the primary receives 2f pre_prepare_update 
messages from different backup replicas (for the same 
client's request), it builds a pre_prepare_update mes- 
sage, including the 2f+l sets of nondeterministic values, 
each protected by the proposer's digital signature or au- 
thenticator. The pre_prepare_update message itself is 
further protected by the primary's signature or authentica- 
tor. The primary then multicasts the message to all backup 
replicas. From now on, the BFT algorithm operates accord- 
ing to the original algorithm, except that the prepare and 
commit messages also carry the digest of the nondetermin- 
istic values, and the 2f+l sets of nondeterministic values are 
delivered to the application layer as part of the execute ( ) 
upcall. The normal operations of the modified BFT algo- 
rithm for this type of nondeterminism is illustrated in Fig- 
ure |2] 

4.3. Controlling VPOST nondeterminism 

The normal operations of the modified BFT algorithm 
in handling this type of replica nondeterminism is shown 
in Figure [3] The primary includes the nondeterminism 
type (i.e., vpost) information in the pre_prepare mes- 
sage without any nondeterministic values and multicasts the 
message to the backup replicas. 

On receiving the pre_prepare message, a backup 
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Figure 3. Normal operations of the modified BFT algorithm in handling verifiable (left) and nonverifi- 
able (right) post-determinable nondeterminism. 



replica performs the check_value ( ) upcall if it has ver- 
ified the client's request and the ordering information. If 
the backup replica confirms the type of nondeterminism, the 
BFT algorithm proceeds to the commit phase as usual. Oth- 
erwise, the backup replica suspects the primary. 

When the primary is ready to deliver the request mes- 
sage, it proceeds to performing the execute ( ) upcall and 
expects to receive both the reply message and the recorded 
nondeterministic values. Once the upcall returns, the pri- 
mary stores the retrieved post-determined nondeterministic 
values, together with the digest of the reply, into a postnd 
log, and sends the reply message to the client. The digest 
of the reply is included so that a backup replica can verify 
if the primary has actually used the nondeterministic values 
to generate the reply. 

A post-commit phase is needed for the primary to dis- 
seminate the data in the postnd log to backup replicas 
and for all correct replicas to be sure they have received 
the same set of values for the corresponding request. Unlike 
the pre-prepare-update phase for controlling npre, the post- 
commit phase involves with all the steps needed for correct 
replicas to reach an agreement on the nondeterministic val- 
ues, which requires three rounds of message exchanges sim- 
ilar to those used to determine the ordering of the requests 
under normal operations. For npre, the prepare and com- 
mit phases needed for the correct replicas to reach a Byzan- 
tine agreement on the nondeterministic values are integrated 
with those for the corresponding request message. We could 
not do so for post-determinable nondeterminism types be- 



cause the ordering for the corresponding request has already 
been decided. 

A backup replica does not deliver a request message un- 
til a Byzantine agreement has been reached on the nondeter- 
ministic values for the request. If the Byzantine agreement 
could not be reached, or the verification of the nondeter- 
ministic values fails, a replica suspects the primary. Fur- 
thermore, when the replica produces a reply for the request, 
the digest of the reply is compared with that supplied by 
the primary. If the two do not match, the backup replica 
suspects the primary. Regardless of the comparison result, 
the backup replica sends the reply message to the client. It 
is safe to do so because if all correct backup replicas pro- 
duce the same reply using the same set of nondeterministic 
values (even if they might be different from the set actually 
used by the primary, which implies that the primary is lying 
and will be suspected), the result is valid. 

4.4. Controlling NPOST nondeterminism 

The handling of non-verifiable post-determinable nonde- 
terminism involves with the same steps as those described 
in the previous subsection until a backup replica is ready to 
deliver the request with the post-determined nondeterminis- 
tic values, as shown in Figure[3] 

The concern here is that a faulty primary could dissemi- 
nate a wrong set of nondeterministic values hoping to either 
confuse the backup replicas, or to block them from provid- 
ing useful services to their clients. For example, if the non- 
deterministic values contain thread ordering information, a 



faulty primary can arrange the ordering in such a way that it 
leads to the crash of the backup replicas (e.g., if the primary 
knows the existence of a software bug that leads to a seg- 
mentation fault), or it may cause a deadlock at the backup 
replicas (it is possible for a replica to perform a deadlock 
analysis before it follows the primary's ordering to prevent 
this from happening). 

Because in general the replica cannot completely verify 
the correctness of the nondeterministic values until it actu- 
ally executes the request, it is important for a backup replica 
to launch a separate monitoring process prior to invoking 
the execute ( ) call. Should the replica run into a dead- 
lock or a crash failure, the monitoring process can restart 
the replica and suspect the primary. 

If it can successfully complete the execute ( ) upcall, 
the backup replica performs the same reply verification pro- 
cedure as that described in the previous subsection, and 
sends the reply to the client. 

4.5. Proof of Correctness 

We now provide an informal proof of correctness of our 
mechanisms. Due to space limitation, we only argue for the 
correctness of the safety property of our mechanisms and 
omit the proof for liveness. Since we do not have space to 
elaborate the view change mechanisms, the proof is further 
limited to the safety property within a single view. 

Theorem 1. If a correct replica delivers a request m with 
a set of nondeterministic data in view v, then no other cor- 
rect replica delivers m with a different set of nondetermin- 
istic data, and all such correct replicas use, or record ( at 
the primary), the same set of nondeterministic data during 
its execution for m. 

For VP re type, the nondeterministic data is proposed by 
the primary and the agreement on the data is carried out to- 
gether with the request message itself. At the end of the 
three-phase BFT algorithm, if some correct replicas agree 
on the ordering of the request message, they reach an agree- 
ment on the nondeterministic data as well. For npre type, 
the nondeterministic data is collectively determined by the 
pre-prepare-update phase, and it is followed by the three- 
phase BFT agreement. Again, if some correct replicas com- 
mit the request m, they also agree on the associated nonde- 
terministic data. For both vpre and npre types, when the 
request m is delivered at a correct replica, the nondetermin- 
istic data that have been agreed-upon are also delivered and 
used for execution. 

For vpost and npost types, the agreement on the non- 
deterministic data among correct replicas are guaranteed by 
the three-phase BFT algorithm executed during the post- 
commit phase. When the request m is delivered at a cor- 
rect backup, the nondeterministic data associated with m 
is also delivered. The primary, if it is correct, must have 
recorded the nondeterministic data during its execution of 
m, and have disseminated the data to the backups during 



the post-commit phase. Therefore, the same nondeterminis- 
tic data are used for execution at the primary (if it is correct) 
and other correct replicas. 

5. Implementation and Performance 

We implemented the core mechanisms described in the 
previous section in C++ and integrated them into the BFT 
framework [7, 8, 9 10 1. The experiments described be- 
low are focused on the evaluation of the cost for providing 
Byzantine fault tolerance to nondeterministic applications 
in the BFT layer. The cost associated with recording non- 
deterministic values, verifying such values, and replaying 
such values in the application layer is not studied in this 
work. 

The development and test platform consists of 14 nodes 
running RedHat 8.0 Linux. Of the 14 computers, 4 of 
them are equipped with Pentium-4 2.8GHz processors and 
the rest have pentium-3 1GHz processors. The comput- 
ers are connected via a 16-port Netgear 100Mbps switch. 
The server replicas run on the four Pentium-4 nodes and the 
clients are distributed across the rest of the nodes. 

Figure H] shows the summary of the end-to-end latency 
and throughput measurements for a client-server applica- 
tion under normal operations for different types of replica 
nondeterminism, including composite types. In each itera- 
tion, each client issues a request to the server replicas and 
waits for the corresponding reply. There is no waiting time 
between consecutive iterations. The size of each request 
and reply is kept fixed at 1KB. In each run, we measure 
the total elapsed time for 10,000 consecutive iterations at 
each client. From the measured time, we derive the average 
end-to-end latency for each request-reply iteration and the 
system throughput. 

The type of nondeterminism and the size of nondeter- 
ministic values vary in different experiments, except dur- 
ing the throughput measurements, where the nondetermin- 
istic values are kept at 256 Bytes for each type. Note that 
the sizes of nondeterministic values shown in the horizon- 
tal axis in Figure Ua) are for each type. That means, for 
composite types, the total size of nondeterministic values is 
twice or three-times as large as those displayed. 

Except for vpre, the handling of other types of nonde- 
terminism involves with one or more phases of message ex- 
changes for correct replicas to reach an agreement on the 
nondeterministic values. As such, as shown in Figure|4j the 
end-to-end latency is noticeably larger, and the throughput 
is smaller, than that of vpre nondeterministic operations. 
The end-to-end latency difference is more significant as the 
size of nondeterministic values involved with each opera- 
tion increases. 

The results shown in Figure [4] are obtained after a num- 
ber of optimizations to the mechanisms described previ- 
ously. Without these optimizations, the latency is signifi- 
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cantly larger and the throughput is much lower, except those 
for vpre nondeterministic operations. 

In the pre-prepare-update phase, which is needed to han- 
dle npre nondeterminism and other composite types in- 
volving with npre nondeterminism, each backup replica 
multicasts its contribution of the nondeterministic values 
to all other replicas, and the primary decides on the col- 
lection (must include the contributions from 2f+l replicas, 
including its own) to be used to calculate the final nonde- 
terministic values. Instead of multicasting the collection of 
nondeterministic values, the primary replica disseminates 
the collection of the digests of the values proposed by each 
replica. This sharply reduces the message size if the size 
of nondeterministic values is large. Since each replica can 
log the nondeterministic values received from other repli- 
cas, a (backup) replica can verify the digests provided by 
the primary using its local copies. A backup replica might 
not have received the values proposed by one or more repli- 
cas included in the primary's message, in which case, the 
replica asks for retransmission of the values. 

During the post-commit phase, which is needed to han- 
dle npost nondeterminism, The data in the postn log is 
piggybacked with the pre_prepare message for the next 
request. This way, the Byzantine agreement for the non- 
deterministic values is reached together with that for the 
ordering of the that request, which reduces the number of 
messages needed to handle this type of nondeterminism. 
Even though the end-to-end latency for a request increases 
slightly as a result, the system throughput is significantly 
improved. To avoid waiting indefinitely for the next request, 
the primary sets a timer. When the timer expires, the pri- 
mary initiates the Byzantine agreement phases for the non- 
deterministic values in conjunction with a null request so 
that the existing mechanisms can be reused. 



It may be surprising to see that the end-to-end latency 
for a request with npre nondeterminism is similar to, or 
slightly larger than, that for a request with npost nonde- 
terminism when there are large quantity of nondeterminis- 
tic values. With the above optimization, the pre-prepare- 
update phase (needed to handle npre) involves with at least 
two large messages (one message per backup replica on its 
proposed nondeterministic values) while the post-commit 
phase (needed to handle npost) involves with only one 
large message (sent by the primary). Due to the same rea- 
son, the throughput for requests with npost nondetermin- 
ism is higher for those with npre nondeterminism when 
sufficient number of concurrent clients are present (so that 
virtually all post-determinable nondeterministic values are 
piggybacked with the pre_prepare messages for other re- 
quests, rather than being sent as separate messages). 

6. Related Work 

Replica nondeterminism has been studied extensively 
under the benign fault model 

[171 l20l . However, there is a lack of systematic classifica- 
tion of the common types of replica nondeterminism, and 
even less so on the unified handling of such nondetermin- 
ism. I4ll5l [l6ll did provide a classification of some types of 
replica nondeterminism. However, they largely fall within 
the types of wrappable nondeterminism and verifiable pre- 
determinable nondeterminism, with the exception of nonde- 
terminism caused by asynchronous interrupts, which we do 
not address in this work. 

The replica nondeterminism caused by multithreading 
has been studied separately from other types of nonde- 
terminism, again, under the benign fault mode only, in 
12 |3[l2l [13 [H US. However, these studies provided valu- 



able insight on how to approach the problem of ensuring 
consistent replication of multithreaded applications. It is 
realized that what matters in achieving replica consistency 
is to control the ordering of different threads on access of 
shared data. The mechanisms to record and to replay such 
ordering have been developed. So do those for checkpoint- 
ing and restoring the state of multithreaded applications (for 
example, 0T|). Even though these mechanisms alone are 
not sufficient to achieve Byzantine fault tolerance for multi- 
threaded applications, they can be adapted and used towards 
this goal. In this paper, we have shown when to record and 
(partially) verify the ordering, how to propagate the order- 
ing, and how to provision for problems encountered when 
replaying the ordering, all under the Byzantine fault model. 

Under the Byzantine fault model, the main effort on the 
subject of replica nondeterminism control so far is to cope 
with wrappable and verifiable pre-determinable replica non- 
determinism EJ [8] [9j HO). In 0, Castro and Liskov pro- 
vided a brief guideline on how to deal with the type of non- 
determinism that requires collective determination of the 
nondeterministic values. The guideline is very important 
and useful, as we have followed in this work. However, the 
guideline is applicable to only a subset of the problems we 
have addressed. 

7. Conclusion and Future Work 

In this paper, we presented a classification of common 
types of replica nondeterminism, and the mechanisms nec- 
essary to handle them in the context of Byzantine fault 
tolerance. We also described how to integrate our mech- 
anisms into a well-known BFT framework J7J [8] [9] [Toll . 
Furthermore, we conducted extensive experiments to eval- 
uate the performance of the BFT framework extended with 
our mechanisms. We show that our mechanisms only incur 
moderate runtime overhead. 

Future work will focus on the development of modules 
and tools that help applications record, verify (if applica- 
ble), and replay nondeterministic values. 
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